Framing the problem¶
The goal of this report is to assist a local car dealership in improving their business by accurately predicting car prices. By using historical car sale data, we will develop a predictive model that estimates the selling price of cars based on various attributes. Additionally, the report will explore other valuable insights from the data to help the dealership make informed business decisions.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly as plotly
import plotly.express as px
plotly.offline.init_notebook_mode()
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import RobustScaler
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from imblearn.over_sampling import SMOTE
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
Getting the data¶
For this report, we will be using this dataset. Kaggle
This dataset covers 8 years of inventory in Canada. Each day their systems crawl and aggregate inventory from over 65k dealer websites to deliver the most comprehensive and up-to-date depictions of market activity available anywhere.
Individual listing records show year, make, model and trim, with VIN-level histories, showing the most recent time the car showed up online back to the earliest, with every change that occurred over that time.
Equipment breakdowns give fuel type, engine size, transmission, color, driveline and body style with links back to the VDP of the source we obtained the information from.
df = pd.read_csv('../data/ca-dealers-used.csv')
df.head(2)
C:\Users\Owner\AppData\Local\Temp\ipykernel_11404\18213597.py:1: DtypeWarning: Columns (13,15) have mixed types. Specify dtype option on import or set low_memory=False.
| id | vin | price | miles | stock_no | year | make | model | trim | body_type | ... | drivetrain | transmission | fuel_type | engine_size | engine_block | seller_name | street | city | state | zip | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | b39ea795-eca9 | 19UNC1B01HY800062 | 179999.0 | 9966.0 | V-P4139 | 2017.0 | Acura | NSX | Base | Coupe | ... | 4WD | Automatic | Electric / Premium Unleaded | 3.5 | V | edmundston honda | 475 Rue Victoria | Edmundston | NB | E3V 2K7 |
| 1 | 026cb5b1-6e3e | 19UNC1B02HY800023 | 179995.0 | 5988.0 | PPAP70374 | 2017.0 | Acura | NSX | Base | Coupe | ... | 4WD | Automatic | Electric / Premium Unleaded | 3.5 | V | garage daniel lessard | 2795 Route-du-prsident-kennedy | Notre-dame-des-pins | QC | G0M 1K0 |
2 rows × 21 columns
As we can see from the above, the dataset has some issues in the 13th and 15th column. So, now we will reload the dataset by specifying the types for those columns.
dtype_spec = {
'fuel_type': 'object',
'engine_block': 'object'
}
df = pd.read_csv('../data/ca-dealers-used.csv', dtype=dtype_spec)
df.head(2)
| id | vin | price | miles | stock_no | year | make | model | trim | body_type | ... | drivetrain | transmission | fuel_type | engine_size | engine_block | seller_name | street | city | state | zip | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | b39ea795-eca9 | 19UNC1B01HY800062 | 179999.0 | 9966.0 | V-P4139 | 2017.0 | Acura | NSX | Base | Coupe | ... | 4WD | Automatic | Electric / Premium Unleaded | 3.5 | V | edmundston honda | 475 Rue Victoria | Edmundston | NB | E3V 2K7 |
| 1 | 026cb5b1-6e3e | 19UNC1B02HY800023 | 179995.0 | 5988.0 | PPAP70374 | 2017.0 | Acura | NSX | Base | Coupe | ... | 4WD | Automatic | Electric / Premium Unleaded | 3.5 | V | garage daniel lessard | 2795 Route-du-prsident-kennedy | Notre-dame-des-pins | QC | G0M 1K0 |
2 rows × 21 columns
df.shape
(393603, 21)
We've sorted that out and we can see from the above that our dataset consists of 393,603 rows and 21 columns, which we will explore in the next section.
Exploratory Data Analysis¶
df.head(5)
| id | vin | price | miles | stock_no | year | make | model | trim | body_type | ... | drivetrain | transmission | fuel_type | engine_size | engine_block | seller_name | street | city | state | zip | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | b39ea795-eca9 | 19UNC1B01HY800062 | 179999.0 | 9966.0 | V-P4139 | 2017.0 | Acura | NSX | Base | Coupe | ... | 4WD | Automatic | Electric / Premium Unleaded | 3.5 | V | edmundston honda | 475 Rue Victoria | Edmundston | NB | E3V 2K7 |
| 1 | 026cb5b1-6e3e | 19UNC1B02HY800023 | 179995.0 | 5988.0 | PPAP70374 | 2017.0 | Acura | NSX | Base | Coupe | ... | 4WD | Automatic | Electric / Premium Unleaded | 3.5 | V | garage daniel lessard | 2795 Route-du-prsident-kennedy | Notre-dame-des-pins | QC | G0M 1K0 |
| 2 | 5cd5d5b2-5cc2 | 19UNC1B02HY800071 | 168528.0 | 24242.0 | B21085 | 2017.0 | Acura | NSX | Base | Coupe | ... | 4WD | Automatic | Electric / Premium Unleaded | 3.5 | V | lougheed acura | 1388 Lougheed Highway | Coquitlam | BC | V3K 6S4 |
| 3 | b32473ed-5922 | 19UNC1B02LY800001 | 220000.0 | 6637.0 | AP5333 | 2020.0 | Acura | NSX | Base | Coupe | ... | 4WD | Automatic | Electric / Premium Unleaded | 3.5 | V | drive autogroup | 1305 Parkway Suite 600 | Pickering | ON | L1V 3P2 |
| 4 | ac40c9fc-0676 | 19UNC1B02LY800001 | 220000.0 | 6637.0 | AP5333 | 2020.0 | Acura | NSX | Base | Coupe | ... | 4WD | Automatic | Electric / Premium Unleaded | 3.5 | V | acura pickering | 575 Kingston Road | Pickering | ON | L1V 3N7 |
5 rows × 21 columns
Overview of the Dataset¶
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 393603 entries, 0 to 393602 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 393603 non-null object 1 vin 393603 non-null object 2 price 358486 non-null float64 3 miles 366590 non-null float64 4 stock_no 365910 non-null object 5 year 393586 non-null float64 6 make 393603 non-null object 7 model 388809 non-null object 8 trim 354824 non-null object 9 body_type 359578 non-null object 10 vehicle_type 355365 non-null object 11 drivetrain 354608 non-null object 12 transmission 357922 non-null object 13 fuel_type 322790 non-null object 14 engine_size 320950 non-null float64 15 engine_block 320439 non-null object 16 seller_name 391371 non-null object 17 street 385674 non-null object 18 city 385829 non-null object 19 state 385767 non-null object 20 zip 385834 non-null object dtypes: float64(4), object(17) memory usage: 63.1+ MB
From the above, we can see all the 21 columns that we have and what they represent.
df.describe()
| price | miles | year | engine_size | |
|---|---|---|---|---|
| count | 3.584860e+05 | 3.665900e+05 | 393586.000000 | 320950.000000 |
| mean | 2.601902e+04 | 7.566339e+04 | 2016.414829 | 2.785073 |
| std | 2.064007e+04 | 5.775442e+04 | 3.345400 | 1.236639 |
| min | 0.000000e+00 | 0.000000e+00 | 1981.000000 | 0.600000 |
| 25% | 1.490000e+04 | 3.491375e+04 | 2015.000000 | 2.000000 |
| 50% | 2.190000e+04 | 6.232800e+04 | 2017.000000 | 2.400000 |
| 75% | 3.199500e+04 | 1.032830e+05 | 2019.000000 | 3.500000 |
| max | 1.288888e+06 | 2.300033e+06 | 2022.000000 | 8.400000 |
Of the 21 columns, only 4 are numeric and the rest are non numeric.
df.isnull().sum()
id 0 vin 0 price 35117 miles 27013 stock_no 27693 year 17 make 0 model 4794 trim 38779 body_type 34025 vehicle_type 38238 drivetrain 38995 transmission 35681 fuel_type 70813 engine_size 72653 engine_block 73164 seller_name 2232 street 7929 city 7774 state 7836 zip 7769 dtype: int64
We have a lot of missing values in our dataset, looking at the price column for example, we have 35,117 missing values, which represents about 10% of our dataset. Because our dataset is large enough, we can afford to remove all the rows with no price value.
import plotly.express as px
# Calculate the percentage of missing values for each column
missing_values = df.isnull().mean() * 100
# Filter out columns with no missing values
missing_values = missing_values[missing_values > 0]
# Create a larger pie chart
fig = px.pie(values=missing_values, names=missing_values.index, title='Missing Values Percentage by Column', width=800, height=800)
fig.show()
We will decide what to do with the rest of the columns as we go on.
But for now, we will remove all the empty rows in the price column from the dataset.
df = df.dropna(subset=['price'])
df['price'].isnull().sum()
0
Let us look at our price column.
# visualize the distribution of the target variable (price) using plotly bar chart
fig = px.histogram(df, x='price', title='Price Distribution')
fig.show()
The reason why our graph is so right skewed is because of the presence of outliers in the column. Let us use the boxplot to examine this further.
fig = px.box(df, x='price', title='Price Distribution')
fig.show()
Looking at the distribution of the price, we can see that the prices of some cars are as high as $1.2M etc, unless these are some very luxurious cars, that was probably a mistake. Let us see what percentage of the cars are worth more than $100k.
df[df['price'] > 100000]
| id | vin | price | miles | stock_no | year | make | model | trim | body_type | ... | drivetrain | transmission | fuel_type | engine_size | engine_block | seller_name | street | city | state | zip | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | b39ea795-eca9 | 19UNC1B01HY800062 | 179999.0 | 9966.0 | V-P4139 | 2017.0 | Acura | NSX | Base | Coupe | ... | 4WD | Automatic | Electric / Premium Unleaded | 3.5 | V | edmundston honda | 475 Rue Victoria | Edmundston | NB | E3V 2K7 |
| 1 | 026cb5b1-6e3e | 19UNC1B02HY800023 | 179995.0 | 5988.0 | PPAP70374 | 2017.0 | Acura | NSX | Base | Coupe | ... | 4WD | Automatic | Electric / Premium Unleaded | 3.5 | V | garage daniel lessard | 2795 Route-du-prsident-kennedy | Notre-dame-des-pins | QC | G0M 1K0 |
| 2 | 5cd5d5b2-5cc2 | 19UNC1B02HY800071 | 168528.0 | 24242.0 | B21085 | 2017.0 | Acura | NSX | Base | Coupe | ... | 4WD | Automatic | Electric / Premium Unleaded | 3.5 | V | lougheed acura | 1388 Lougheed Highway | Coquitlam | BC | V3K 6S4 |
| 3 | b32473ed-5922 | 19UNC1B02LY800001 | 220000.0 | 6637.0 | AP5333 | 2020.0 | Acura | NSX | Base | Coupe | ... | 4WD | Automatic | Electric / Premium Unleaded | 3.5 | V | drive autogroup | 1305 Parkway Suite 600 | Pickering | ON | L1V 3P2 |
| 4 | ac40c9fc-0676 | 19UNC1B02LY800001 | 220000.0 | 6637.0 | AP5333 | 2020.0 | Acura | NSX | Base | Coupe | ... | 4WD | Automatic | Electric / Premium Unleaded | 3.5 | V | acura pickering | 575 Kingston Road | Pickering | ON | L1V 3N7 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 387560 | 75485435-0109 | ZPBUA1ZLXKLA03998 | 259000.0 | 10287.0 | 904888 | 2019.0 | Lamborghini | Urus | Base | SUV | ... | 4WD | Automatic | Premium Unleaded | 4.0 | V | hgrgoire | 625 Rue Dubois | Saint-eustache | QC | J7P 3W1 |
| 389789 | 4c9c3f8f-f6d3 | 4JGFB5KB8MA310456 | 100177.0 | 90.0 | 21028102 | 2021.0 | Mercedes-Benz | GLE | GLE450 | SUV | ... | 4WD | Automatic | Electric / Unleaded | 3.0 | I | mercedes-benz north vancouver | 1375 Marine Drive | North Vancouver | BC | V7P 3E5 |
| 393547 | 1b392ca8-b5c8 | W1KUG7DB1LA529608 | 120790.0 | 4995.0 | 220351 | 2020.0 | Mercedes-Benz | S-Class | S560e | Sedan | ... | RWD | Automatic | Electric / Unleaded | 3.0 | V | mercedes-benz sherbrooke | 4787, Boul. Bourque | Sherbrooke | QC | J1N 2G6 |
| 393601 | efa897c2-a52f | WP1BE2AY4LDA55072 | 115995.0 | NaN | NaN | 2020.0 | Porsche | Cayenne Coupe | E-Hybrid | SUV | ... | 4WD | Automatic | Electric / Unleaded | 3.0 | V | porsche centre winnipeg | 660 Pembina Highway | Winnipeg | MB | R3M 2M5 |
| 393602 | 479607ed-62af | WP1BE2AY8LDA55284 | 113530.0 | 7595.0 | P16107 | 2020.0 | Porsche | Cayenne Coupe | E-Hybrid | SUV | ... | 4WD | Automatic | Electric / Unleaded | 3.0 | V | h.j. pfaff motors | 33 Auto Park Cir | Woodbridge | ON | L4L 8R1 |
1840 rows × 21 columns
Perusing through the columns which prices more than 100k, I can see some Porsches and some Lamborghinis, which are nice, but for this report, being done for a local small business delearship, we will exclude these cars so as not to skew our dataset.
df.shape
(358486, 21)
df = df[df['price'] <= 100000]
df.shape
(356646, 21)
fig = px.histogram(df, x='price', title='Price Distribution')
fig.show()
fig = px.box(df, x='price', title='Price Distribution')
fig.show()
We can see that our prices are relatively normally distributed, with the median car price coming in at around $22k.
Let us take care the rest of the dataset.
df.isnull().sum()
id 0 vin 0 price 0 miles 22414 stock_no 23142 year 15 make 0 model 4182 trim 33175 body_type 30610 vehicle_type 34539 drivetrain 33155 transmission 30371 fuel_type 62233 engine_size 64009 engine_block 64353 seller_name 2042 street 6601 city 6564 state 6624 zip 6561 dtype: int64
numeric_df = df.select_dtypes(include=[np.number])
numeric_df.columns
Index(['price', 'miles', 'year', 'engine_size'], dtype='object')
non_numeric_df = df.select_dtypes(exclude=[np.number])
non_numeric_df.columns
Index(['id', 'vin', 'stock_no', 'make', 'model', 'trim', 'body_type',
'vehicle_type', 'drivetrain', 'transmission', 'fuel_type',
'engine_block', 'seller_name', 'street', 'city', 'state', 'zip'],
dtype='object')
# for all columns in the non-numeric dataframe, print the number of unique values
for col in non_numeric_df.columns:
display(f'{col}: {non_numeric_df[col].nunique()}')
'id: 356646'
'vin: 204955'
'stock_no: 175130'
'make: 55'
'model: 850'
'trim: 1462'
'body_type: 21'
'vehicle_type: 2'
'drivetrain: 3'
'transmission: 2'
'fuel_type: 24'
'engine_block: 3'
'seller_name: 4980'
'street: 4805'
'city: 758'
'state: 15'
'zip: 3523'
#no of empty rows in the non_numeric_df
non_numeric_df.isnull().sum()
id 0 vin 0 stock_no 23142 make 0 model 4182 trim 33175 body_type 30610 vehicle_type 34539 drivetrain 33155 transmission 30371 fuel_type 62233 engine_block 64353 seller_name 2042 street 6601 city 6564 state 6624 zip 6561 dtype: int64
For our categorical variables, we will replace all the missing values with a placeholder like 'missing'.
# for all columns in the non-numeric dataframe, replace missing values with 'Missing' in df
for col in non_numeric_df.columns:
df[col] = df[col].fillna('Missing')
# check if there are any missing values in the non-numeric dataframe
df[non_numeric_df.columns].isnull().sum()
id 0 vin 0 stock_no 0 make 0 model 0 trim 0 body_type 0 vehicle_type 0 drivetrain 0 transmission 0 fuel_type 0 engine_block 0 seller_name 0 street 0 city 0 state 0 zip 0 dtype: int64
For the numeric columns, we will replace the misisng values with the median of the respective columns.
numeric_df.isnull().sum()
price 0 miles 22414 year 15 engine_size 64009 dtype: int64
# for all columns in the numeric dataframe, replace missing values with the median value in df
for col in numeric_df.columns:
df[col] = df[col].fillna(df[col].median())
df.isnull().sum()
id 0 vin 0 price 0 miles 0 stock_no 0 year 0 make 0 model 0 trim 0 body_type 0 vehicle_type 0 drivetrain 0 transmission 0 fuel_type 0 engine_size 0 engine_block 0 seller_name 0 street 0 city 0 state 0 zip 0 dtype: int64
We have no more missing values in our dataset, now we can proceed with our analysis.
numeric_df = df.select_dtypes(include=[np.number])
non_numeric_df = df.select_dtypes(exclude=[np.number])
Correlation Analysis¶
import seaborn as sns
correlation_matrix = numeric_df.corr()
plt.figure(figsize=(15, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Matrix')
plt.show()
We can see that the engine size and year the car was made has a relatively positive correlation with the price, while the miles has a negative correlation to the price.
What this hints at is that these features most likely have an effect on the price of a car.
# using plotly to show the scatter plot of the relationship between the price and all the columns in the numeric dataframe except price
for col in numeric_df.columns:
if col != 'price':
fig = px.scatter(df, x=col, y='price', title=f'{col} vs Price')
fig.show()